--- layout: post title: Coronory Artery Disease Diagnosis description: . tags: - Logistic Function - Logistic Regression - Machine Learning - Cross-Entropy - Classification - Gradient Descent - Neural Networks - Notebook ---

Coronory Artery Disease Diagnosis

Data Dictionary

Using CRISP-DM to improve business outcomes

Introduction

This blog focuses on the analysis of a business problem by means of following the CRISP-DM process as described by the following process flow:

CRISP-DM

The CRISP-DM (CRoss-Industry Standard Process for Data Mining) methodology provides essential support for those seeking to understand and practise data mining/ data science. It is a business focused methodology which places the focus of analysis on that which matters i.e. focusing on the business problem and data rather than methodology. More on the CRISP-DM process can be found here:

CRISP-DM

The author of CRISP-DM, Tom Khabaza has been practising data science for decades and drafted the methodology in 1999! It has been used by many since and is thoroughly worthwhile studying in great depth as it provides many pearls of wisdom on data science and business analysis.

Although the blog will outline the business analysis, and hence will not focus on code, code for the analysis can be accessed by means of expanding the code sections in the blog for those interested (in truth it is never possible to separate the two completely, and hence the code has been retained for reference purposes).

Section 1: Business Understanding

The business area of concern for this analysis is that of the diagnosis of Coronary Artery Disease by medical practitioners. Currently, the Gold Standard for diagnosis is Angiography. The problem we are investigating, is that in many settings too many Angiograms are being performed which could result in poorer patient care and outcomes.

Coronary Artery Disease (CAD ) is a disease in which there is a narrowing or blockage of the coronary arteries (blood vessels that carry blood and oxygen to the heart). Coronary heart disease is usually caused by atherosclerosis (a buildup of fatty material and plaque inside the coronary arteries).

Data Dictionary

We attempt to answer the following business questions by performing this analysis.

Question 1

  • Can data science be used to improve the diagnosis of Coronary Artery Disease by means of using existing data sources?

Question 2

  • Can data science be used to reduce the number of Angiograms performed in settings where this is problematic?

Question 3

  • What are the 4 factors most highly correlated with CAD within our dataset?

We will attempt to answer these questions by interrogating data available to us.

In [1]:
Section 2: Data Understanding

For this analysis we use the Cleveland "Coronary Artery Disease" dataset found on the UCI Machine Learning Repository at the following location:

Heart Disease Dataset

The data represents data collected for 303 patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984. The data in the independent group are all available prior to an Angiogram taking place (routine, test and demographic data). The 13 independent/ features variables can be divided into 3 groups as follows:

Routine evaluation (based on historical data):

  • ECG at rest
  • Serum Cholesterol
  • Fasting blood sugar

Non-invasive test data (informed consent obtained for data as part of research protocol):

  • Exercise ECG
    • ST-segment peak slope (upsloping, flat or downsloping)
    • ST-segment depression
  • Excercise Thallium scintigraphy (fixed, reversible or none)
  • Cardiac fluoroscopy (number of vessels appeared to contain calcium)

Other demographic and clinical variables (based on routine data):

  • Age
  • Sex
  • Chest pain type
  • Systolic blood pressure
  • ST-T-wave abnormality (T-wave abnormality)
  • Probably or definite ventricular hypertrophy (Este's criteria)

Approach

As a starting hypothesis, first prize would be to create a predictive model for prediction of Coronary artery disease based on information available prior to an Angiogram taking place. Second prize would be to identify factors associated with Coronary Artery Disease as indicated by a coronary angiography interpreted by a Cardiologist, which could be used to create a clinical algorithm to decide whether an Angiogram is indicated - again based on data available prior to an Angiogram taking place.

For our predictive model we will use the dependent/ response variable of the angiographic test result indicating a >50% diameter narrowing, which indicates Coronary Artery disease.

We will perform various Machine Learning techniques and measure model classification accuracy. Furthermore we will do an analysis of significance of the various features as means of classifying feature importance by applying various techniques.

We aim to achieve this by following the CRISP-DM pipeline approach of deploying a variety of ML techniques to build a predictive model and analyse its results. In the process we hope to gain valuable insights. The various steps in the process are as follows (not necessarily in this order):

  • Load data
  • Prepare data
    • Clean data
      • Missing values
      • Outliers
      • Erroneous values
    • Explore data
      • Exploratory descriptive analysis (EDA)
      • Correlation analysis
      • Variable cluster analysis
    • Transform Data
      • Engineer features
      • Encode data
      • Scale & normalise data
      • Impute data
      • Feature selection/ importance analysis
  • Build model
    • Model selection
    • Data sampling (validation strategy, imbalanced classification)
    • Hyperparameter optimisation
  • Validate model
    • Accuracy testing
  • Analysis of results
    • Response curves
    • Accuracy analysis
    • Commentary

Let us start by importing the data.

In [2]:

By interrogating the data and documentation, we created a data dictionary for the data is as follows:

Data Dictionary

The ca_disease variable indicates the level of artery blockage as follows:

Data Dictionary

We will use this variable as our dependent variable.

Let us explore the data!

Sections 3: Data Preparation and Exploration
In [3]:

Now our set is finally ready for further analysis. We will now look at the distribution of variables and any possible outliers or heavy tailed distributions. We start by looking at the number of unique records per variable. We start here as variables with too little variability will be discarded from the analysis. We also define the levels of measurement for each variable here.

In [4]:
age                     41
sex                      2
chest_pain_type          4
rest_blood_press        50
cholesterol            152
fasting_blood_sugar      2
rest_ecg                 3
max_heart_rate          91
exer_ind_angina          2
st_depression           40
st_slope                 3
num_major_vessels        4
thallium_scint           3
ca_disease               5
dtype: int64

There are no columns with only one value. We therefore retain all columns for ML purposes as there is enough variability to warrant using the data. There are many variables with fewer than 10 levels which could be considered as categorical. Based on our initial assessment of the data we will work with levels of measurement for the data as follows:

  • age (continuous)
  • sex (binary)
  • chest_pain_type (ordinal)
  • rest_blood_press (continuous)
  • cholesterol (continuous)
  • fasting_blood_sugar (binary)
  • rest_ecg (ordinal)
  • max_heart_rate (continuous)
  • exer_ind_angina (binary)
  • st_depression (continuous)
  • st_slope (ordinal)
  • num_major_vessels (ordinal)
  • thallium_scint (ordinal - needs reordering)
  • ca_disease (binary - we will need to transform as there are actually 5 levels in the data)

At this point it seems as if the only nominal data is binary, which means we might not need any One Hot Encoding initially. We will leave the ordinal data as is for the initial analysis.

It is important to note the large number binary and ordinal variables which indicates that the original continuous variables have in all likelihood been discretised by the original authors. This could possibly be revisited at a later later stage by considering the original data.

Next we look at the distribution of the data.

We now extract data according levels of measurement first to ease analysis. We also rename the variables to enable ease of interpretation.

In [5]:

We now consider the response variable.

In [6]:
0    164
1     55
2     36
3     35
4     13
Name: ca_disease, dtype: int64

The spread of the data is good for classification, as there are a large number of positive cases. If one combines classes 1, 2, 3 and 4 as suggested there will be a fairly even split between positive and negative outcomes. Let us confirm if this is the case.

In [7]:
   counts         %
0     164  0.541254
1     139  0.458746

As expected, the distribution of positive and negative values is balanced with 46% of values denoting a positive outcome. It is therefore very unlikely that we would need to make allowance for imbalanced classes (by resampling, boosting or using an alternative ML algorithm such as Boosting) as there is a sufficiently large proportion of positive outcomes. The sample size of this dataset is however very small i.e. 303, so we will revisit this assumption once we have done some accuracy testing.

Let us consider the categorical variables.

In [8]:
In [9]:

The data looks fine from a modelling perspective as there are no variables with empty classes. We picked up from the data dictionary that the 'thallium_scint' variable needs to be recoded due to incorrect labelling i.e. the order does not result in an increasing ordinal value.

It is interesting to note that there are ~30% females and ~60% males. The distribution of chest pain also seems increase in a linear fashion for this population, with the largest portion of the population suffering from severe chest pain.

Next, let us consider the continuous variables. We start by looking at age patterns.

In [10]:

It is clear that individuals suffering from coronary artery disease have a higher average age.

In [11]:
Out[11]:
age rest_blood_press cholesterol max_heart_rate st_depression
count 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.438944 131.689769 246.693069 149.607261 1.039604
std 9.038662 17.599748 51.776918 22.875003 1.161075
min 29.000000 94.000000 126.000000 71.000000 0.000000
25% 48.000000 120.000000 211.000000 133.500000 0.000000
50% 56.000000 130.000000 241.000000 153.000000 0.800000
75% 61.000000 140.000000 275.000000 166.000000 1.600000
max 77.000000 200.000000 564.000000 202.000000 6.200000
In [12]:

The violin plots demonstrate that the distributions for age, maximum HR and ST depression differ between individuals with and those without ca disease, whereas there is little difference in the distributions for resting BP and cholesterol.

The violin plot for age against coronary artery disease demonstrates that the age of individuals without ca disease is evenly spread between the ages of 40 and 65, with some younger patients below the age of 30, whereas individuals with ca disease are mostly older, with a median age of approx 60 and few, if any, below 30 years of age.

The median maximum HR for individuals without ca disease is higher (~160) than for individuals with ca disease (~150), with a narrower distribution around the mean, whereas individuals with ca disease have a skewed distribution towards lower maximum HR, with a larger proportion having max HR below 100 than healthy individuals.

The distribution for ST depression is starkly different, with individuals without ca disease having a median ST depression of 0, with a narrow distribution around the mean, and a small proportion having ST depression between 1 and 2.

In contrast, individuals with ca disease follow a broader distribution around a median of ~1.5, with a substantial proportion of individuals with ST depression >2. Resting blood pressure and cholesterol do not appear to be significantly different between patients with and without ca disease, with both groups having similar median resting BP (around 125mmHg) and cholesterol (200-250) and roughly even spread around the point estimates. A small number of individuals with ca disease have much higher resting BP of >200, whereas none of those without ca disease have a resting BP >200. However, this may not be statistically significant. Interestingly, some individuals without ca disease have very high cholesterol (500-600).

In [13]:
In [14]:

Similar observations to those made for the density and violin plots. We are dealing with an older population here with average age of 54 years old. There are a few outliers for high resting blood pressure with the distribution showing a slight skew to the right. Likewise for cholesterol and st_depression, with these two showing even higher skewness. Conversely max_heart_rate has outliers to left and slight skewness to left too. This makes sense, as higher values for the prior could indicate poorer health, whereas lower values for max_heart_rate could indicate poorer health, as observed in the violin plots.

The distributions of the feature variables have varying scales, so standardisation would be required for ML purposes. For regression, normalisation might improve outcomes (for this investigation we will however not perform normalisatoin). Investigation into outliers is recommended as it might reveal interesting facts and would improve the model performance if outliers were addressed.

Section 4: Modeling

Our first objective is to obtain a baseline measure of the strength of association between all the variables and the outcome. For this, we will build a basic Logistic Regression model, without transforming or scaling any of the variables.

We start by splitting the response and the features.

In [15]:
In [16]:

We now build and test a naive logistic regression model - without any transformations or optimisations.

In [17]:
AUC: 0.8673611111111111
In [18]:
Normalized confusion matrix

accuracy:			0.825  
precision:			0.799 
sensitivity:			0.867

specificity:			0.783 
negative predictive value:	0.854

false positive rate:		0.217  
false negative rate:		0.133 
false discovery rate:		0.201

As can be seen from the accuracy measurements the baseline model performs very well. A C-statistic of 87% on data that has not been scaled or transformed is a very good result. This result confirms our EDA outcomes which showed correlation between age and the continuous variables and the outcome variable. Based on this result it is evident that this correlation is strong for at least a few of the variables.

We will now investigate this correlation further, but first, we will transform and scale the variables to see whether we can improve the accuracy obtained by the naive modelling approach.

Improve on Logistic Regression

We now scale and transform variables to obtain a very basic improvement on the naive model. We will not perform extensive feature engineering or advanced hyperparameter tuning at this stage.

In [19]:

We now build a logistic regression model with data scaled and transformed.

In [20]:
AUC: 0.8770833333333333
In [21]:
Normalized confusion matrix

accuracy:			0.82  
precision:			0.805 
sensitivity:			0.844

specificity:			0.795 
negative predictive value:	0.836

false positive rate:		0.205  
false negative rate:		0.156 
false discovery rate:		0.195

After scaling and transforming the data, we observe a modest improvement in the accuracy of the model. Although accuracy in itself probably does not warrant the transformation and scaling, model performance in terms of convergence has improved by an order of magnitude as number of iterations required before convergence was previously larger than 1000,000 and after the data has been scaled number of iterations reduced to less than 100,0000. This is an encouraging result as it shows the model now captures the signal in the data without the need for excessive computation which will allow us to use more complex models to improve accuracy.

The objective of this study is however not to maximise accuracy, but to find correlation between predictors and the response. Therefore we turn our attention now to study variable correlations.

Improve further on Logistic Regression

We now perform feature selection in order to ascertain whether a smaller parsimonious model could be built with fewer variables. As per the article by (Detrano et al., 1989) this could be useful from a practical perspective as not all healthcare settings have all the variables to their disposal which necessitates the deployment of several complex predictive models which is not practical from an operational perspective.

We will first perform correlation and regression tests on the data. These tests are best performed by considering continuous and categoric variables separately due to the intrinsic difference in regression coefficient values for these variables. We will then perform a few numeric methods on the full dataset and compare results.

We start by considering the continuous variables.

In [22]:

We see that there is a very strong inverse correlation between maximum heart rate and age. This makes sense as one's maximum heart typically decreases with age. Similarly there is a strong inverse correlation between max_heart_rate and st_depression. This makes sense as a lower max_heart_rate is likely to indicate poorer health and could therefore be correlated with a greater st_depression.

We also see that there is a strong positive correlation between maximum heart rate and both cholesterol and resting blood pressure. High blood pressure and cholesterol are typically indications of poor health which would result in lower maximum heart rate.

Another observation of interest is the strong correlation between cholesterol and age. These variables could make strong combined predictors for a next iteration of the model.

The first method we use is to compare the relative importance of feature variables is that of Logistic Regression. We will consider the regression coefficient values for all our continuous variables. Scikit-learn does not implement feature importance measures for logistic regression. We therefore make use of the statsmodel libraries implementation. There is no option for a Univariate test, so we will first perform a multivariate analysis. We will thereafter make use of the mlextend library to perform a Univariate Logistic Regression test.

In [23]:
Optimization terminated successfully.
         Current function value: 0.509602
         Iterations 6
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: 0.260     
Dependent Variable: y                AIC:              241.3593  
Date:               2023-01-05 11:01 BIC:              258.4841  
No. Observations:   227              Log-Likelihood:   -115.68   
Df Model:           4                LL-Null:          -156.37   
Df Residuals:       222              LLR p-value:      8.8664e-17
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
-----------------------------------------------------------------
                   Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-----------------------------------------------------------------
age                0.0147   0.0176  0.8334 0.4046 -0.0199  0.0493
rest_blood_press   0.0160   0.0097  1.6516 0.0986 -0.0030  0.0349
cholesterol        0.0050   0.0034  1.4622 0.1437 -0.0017  0.0116
max_heart_rate    -0.0343   0.0066 -5.1736 0.0000 -0.0474 -0.0213
st_depression      0.7812   0.1623  4.8118 0.0000  0.4630  1.0994
=================================================================

Feature:       max_heart_rate	Score:	6.63890
Feature:        st_depression	Score:	5.82521
Feature:     rest_blood_press	Score:	1.00607
Feature:          cholesterol	Score:	0.84262
Feature:                  age	Score:	0.39296

From this analysis it can be seen that the only variables of significance are max_heart_rate and st_depression. The remainder of the variables will be rejected based on their coefficient sizes.

Now we perform a univariate comparison between all the features. We use the mlxtend library for this.

In [24]:
Best accuracy score: 0.77
Best subset (indices): (12,)
Best subset (corresponding names): ('12',)
Out[24]:
<matplotlib.collections.PolyCollection at 0x1e66f765a60>
Out[24]:
feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
12 (12,) [0.7391304347826086, 0.782608695652174, 0.7555... 0.76657 (12,) 0.021374 0.01663 0.008315
2 (2,) [0.782608695652174, 0.782608695652174, 0.82222... 0.757488 (2,) 0.063772 0.049617 0.024809
8 (8,) [0.7391304347826086, 0.7608695652173914, 0.777... 0.748889 (8,) 0.042659 0.03319 0.016595
7 (7,) [0.6739130434782609, 0.7391304347826086, 0.844... 0.727053 (7,) 0.080602 0.062711 0.031356
10 (10,) [0.6304347826086957, 0.7391304347826086, 0.733... 0.709469 (10,) 0.058175 0.045262 0.022631
11 (11,) [0.6956521739130435, 0.7391304347826086, 0.666... 0.709179 (11,) 0.042433 0.033015 0.016507
9 (9,) [0.6304347826086957, 0.6304347826086957, 0.711... 0.696618 (9,) 0.074736 0.058147 0.029074
0 (0,) [0.6304347826086957, 0.717391304347826, 0.6, 0... 0.651787 (0,) 0.067138 0.052236 0.026118
1 (1,) [0.5434782608695652, 0.5652173913043478, 0.577... 0.581739 (1,) 0.104977 0.081676 0.040838
4 (4,) [0.5217391304347826, 0.5217391304347826, 0.577... 0.57314 (4,) 0.068192 0.053055 0.026528
6 (6,) [0.5, 0.6304347826086957, 0.4222222222222222, ... 0.554976 (6,) 0.113244 0.088108 0.044054
3 (3,) [0.6086956521739131, 0.4782608695652174, 0.533... 0.541836 (3,) 0.054058 0.042059 0.02103
5 (5,) [0.43478260869565216, 0.391304347826087, 0.511... 0.449662 (5,) 0.050502 0.039292 0.019646

From this analysis we can see that there are a large number of very strong predictors in this set of variables. thallium_scint scores 77% for accuracy and has the smallest confidence interval. exer_ind_angina and num_major_vessels similarly have high accuracy and small confidence intervals. chest_pain_type and max_heart_rate also have very high accuracy scores.

Next we will however make use of scikit-learn's native feature extraction methods - which also allow for Univariate tests. The Uni-variate Anova test on continuous variables as implemented in SelectKBest function 'f_classif' will be used. Let's see what the results are.

In [25]:
Feature:       max_heart_rate	Score:	11.81243
Feature:        st_depression	Score:	11.80165
Feature:                  age	Score:	3.80501
Feature:     rest_blood_press	Score:	1.77980
Feature:          cholesterol	Score:	1.20672

What is notable in this analysis is the fact that age shows a higher significance here than for the previous test. This is due to the fact that age and max_heart_rate are cross-correlated as seen from Pearson's cross correlation test - reported a bit later in this document. The strong correlation between max_heart_rate and ca_disease diminishes the impact of age in multivariate tests. Univariate tests are better suited to this analysis for this reason.

Although age does clearly have value as a variable, and in general it is good to include in any healthcare regression analysis due to the insights it brings, we will exclude it based on test results and levels of significance of other variables being much greater and capturing the effect of age sufficiently. We later discuss a strategy for the inclusion of age at a later stage.

We will now consider the categorical variables. Let's see what the results are.

In [26]:
Feature:    num_major_vessels	Score:	13.64256
Feature:       thallium_scint	Score:	12.12447
Feature:      exer_ind_angina	Score:	9.16471
Feature:      chest_pain_type	Score:	3.09327
Feature:             st_slope	Score:	2.11834
Feature:                  sex	Score:	1.14705
Feature:             rest_ecg	Score:	0.87596
Feature:  fasting_blood_sugar	Score:	0.06877

num_major_vessels, thal_scint and exer_ind_angina are all extremely strong predictors. chest_pain_type, st_slope and sex also contribute to the overall classification. From this analysis the only non-significant variables are rest_ecg and fasting_blood_sugar.

We have now analysed continuous at categorical data separately from a statistical perspective. Before we make the final decision on what variables to drop, we will now consider an ML technique for deriving feature importance i.e. Decision Trees and Random Forests. Unlike the case of regression, we can analyse and draw conclusions on continuous and categorical data together when using these algorithms as they are impervious to differences in variable type. Another nice feature about Trees is that we don't have to standardise and normalise features which makes visual analysis a lot more intuitive. We therefore use our initial untransformed dataset for this analysis.

In [27]:
AUC: 0.6972222222222222
In [28]:
Normalized confusion matrix

accuracy:			0.697  
precision:			0.705 
sensitivity:			0.676

specificity:			0.718 
negative predictive value:	0.689

false positive rate:		0.282  
false negative rate:		0.324 
false discovery rate:		0.295
In [29]:
Feature:       thallium_scint	Score:	0.52965
Feature:      exer_ind_angina	Score:	0.13927
Feature:        st_depression	Score:	0.12318
Feature:                  age	Score:	0.09639
Feature:      chest_pain_type	Score:	0.06617
Feature:          cholesterol	Score:	0.04124
Feature:    num_major_vessels	Score:	0.00410
Feature:             st_slope	Score:	0.00000
Feature:       max_heart_rate	Score:	0.00000
Feature:             rest_ecg	Score:	0.00000
Feature:  fasting_blood_sugar	Score:	0.00000
Feature:     rest_blood_press	Score:	0.00000
Feature:                  sex	Score:	0.00000
In [30]:
Out[30]:

The model has accuracy below 70% (ROC curve slope flatter than models thus far) and the feature importance results are not very convincing seeing as many values are missing. This model needs a bit more work. Interesting to note that Thallium Scintograpy comes out very strongly even in this sub-optimal model. We will next look at random forests to see if we can improve on the single tree's accuracy.

In [31]:
AUC: 0.7708333333333333
In [32]:
Normalized confusion matrix

accuracy:			0.786  
precision:			0.764 
sensitivity:			0.828

specificity:			0.745 
negative predictive value:	0.812

false positive rate:		0.255  
false negative rate:		0.172 
false discovery rate:		0.236
In [33]:
Feature:       max_heart_rate	Score:	0.12651
Feature:       thallium_scint	Score:	0.12493
Feature:        st_depression	Score:	0.11981
Feature:      chest_pain_type	Score:	0.11469
Feature:                  age	Score:	0.09253
Feature:      exer_ind_angina	Score:	0.09140
Feature:          cholesterol	Score:	0.07834
Feature:     rest_blood_press	Score:	0.07699
Feature:    num_major_vessels	Score:	0.07549
Feature:             st_slope	Score:	0.05451
Feature:                  sex	Score:	0.01734
Feature:             rest_ecg	Score:	0.01635
Feature:  fasting_blood_sugar	Score:	0.01110
In [34]:
Out[34]:

The Random Forest plot is interesting to analyse. Visually one can observe that ca disease (blue nodes) is evenly spread throughout the leave nodes of the entire tree. A large proportion of the early ca disease nodes occur for individuals with maximum heart rate < 150 and cholesterol >210. From here if ST depression >0.8 and one is male around 20% of the overall population is classified as having ca disease. Likewise, a large proportion of the population with max heart rate >150 and chest pain < 3.5 is classified as not having ca disease (orange nodes). Another interesting factor is that Thallium Scintography is reported as the second most important feature. It does however not feature strongly in the Decision Tree. It is likely that strong cross-correlation with other strong features such as maximum heart rate causes the Thallium feature to only surface as a confirmatory feature at lower levels in the tree.

We now build our final Logistic Regression model with the variables selected.

In [35]:
AUC: 0.8694444444444445
In [36]:
Normalized confusion matrix

accuracy:			0.87  
precision:			0.829 
sensitivity:			0.931

specificity:			0.809 
negative predictive value:	0.921

false positive rate:		0.191  
false negative rate:		0.069 
false discovery rate:		0.171

The accuracy results indicate that even though 5 variables were dropped, the model accuracy did not reduce by a significant amount. We can therefore confidently deploy this model with the knowledge that it is both robust and accurate.

Compare Logistic regression with Multi-Layer Perceptron (MLP)

We can now build a Multi Layer Perceptron to compare with the Logistic Regression.

MSE before model optimisation.

In [37]:
Out[37]:
0.8289473684210527

We now optimise the NN architecture.

In [38]:

Now we optimise neural network regularisation parameter

In [39]:

The highest cross-validation accuracy score and hence the value to use for the alpha parameter is as follows.

In [40]:

MSE after regularisation

In [41]:
Out[41]:
0.8289473684210527
Section 4: Analysis of results

Plot response curves

In [42]:
In [43]:

Our model is accurate enough to capture the directly proportionate relationship between several response variables (in order of strength of association, based on response curve output):

  • thallium_scint
  • num_major_vessels
  • st_slope
  • st_depression
  • exer_ind_angina
  • chest_pain_type
  • sex

and the inversely proportional relationship between:

  • max_heart_rate

and the outcome of confirmed Coronary Artery Disease. This is a positive outcome, as it means the model as applied to the validation dataset managed to capture the underlying signals in the data. We can therefore conclude that the model generalises well and that its accuracy is sufficiently high for this model to be used based on the features captured.

This makes sense if one takes into account that the first two variables:

  • thallium_scint: Arteries found to be: 1. Normal 2. Reversible defect and 3. Fixed defect
  • num_major_vessels: Number of major vessels (0-3) coloured by fluoroscopy

are by nature close to the definition of Coronary Artery Disease itself.

Accuracy analysis

In [44]:
AUC: 0.8236111111111112
In [45]:
Normalized confusion matrix

accuracy:			0.842  
precision:			0.808 
sensitivity:			0.897

specificity:			0.787 
negative predictive value:	0.884

false positive rate:		0.213  
false negative rate:		0.103 
false discovery rate:		0.192
Conclusion

Question 1

  • Can data science be used to improve the diagnosis of Coronary Artery Disease by means of using existing data sources?

Given the confidence in the Gold Standard i.e. Angiography and the consequences of incorrect diagnosis, in my mind it is unlikely that a test resulting in a sensitivity of approximately 90% or less will be considered as a replacement for Angiography - which is the accuracy we managed to attain by using data available prior to Angiography and a variety of ML approaches.

Question 2

  • Can data science be used to reduce the number of Angiograms performed in settings where this is problematic?

This analysis identified the 8 most important features to consider which are: thallium_scint, num_major_vessels, st_slope, st_depression, max_heart_rate, exer_ind_angina, chest_pain_type and sex.

An understanding of the factors contributing to a positive Angiogram test would assist clinicians in deciding when an Angiogram might be indicated and what the likely outcome would be. This could assist in early intervention, workup and planning.

The Decision Tree provides useful information as a starting point for a discussion on an algorithm to decide whether an Angiogram is indicated for a particular patient. Further analytic work to assist with such a discussion could be to investigate cut-off points for different age/ sex groups or for populations with different prevalence of disease.

Question 3

Thallium_scint, exer_ind_angina, st_depression and max_heart_rate have high accuracy and small confidence intervals when calculating correlation with CAD using a univariate logistic regression. This is arguably the most accurate method for calculating correlation in this setting for both continuous and categorical data. These variables however also feature highly in the SelectKbest and Random Forest variable importance tests and can therefore be considered as the most important factors in determining CAD.

Conclusion

Overall this analysis has provided valuable insights into the usage of data to assist medical practitioners with clinical decisions. Given the similar levels of accuracy that both the Logistic and MLP models attained it will be up to clinical decision makers to decide on the utility of these approaches for predicting CAD without performing an Angiogram. It is however more likely in my mind that predicting CAD in this manner is not feasible, and that one could focus on using insights into association of predictors to develop a clinical algorithm to reduce the number of Angiograms performed in a clinical setting.